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Abstract 

Diabetes, named also silent killer, is a metabolic disease 
characterized by high blood glucose levels, which result 
from body does not produce enough insulin or the body is 
resistant to the effects of insulin. Classification models are 
one of the most widely used groups of data mining tools that 
greatly help physicians to improve their prognosis, 
diagnosis or treatment planning procedures. Classification 
accuracy is one of the most important features in order to 
choose the appropriate classification model; hence, the 
researches directed at improving upon the effectiveness of 
these models have never stopped. Nowadays, despite the 
numerous classification models proposed in several past 
decades, it is widely recognized that diabetes are extremely 
difficult to classify. In this paper, a hybrid binary 
classification model is proposed for diabetes type II 
classification, based on the basic concepts of soft computing 
and artificial intelligence techniques. Empirical results of 
Pima Indian diabetes data classification indicate that hybrid 
model is generally better than other linear/nonlinear, 
soft/hard, and classic/intelligent classification models 
presented for diabetes classification. Therefore, our 
proposed model may be a suitable alternative model for 
medical classification to achieve greater accuracy, and to 
improve medical diagnosis. 
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Introduction 

Diabetes mellitus has become a general chronic 
disease that affects between 2% and 4% of the global 
population, and its avoidance and effective treatment 
are undoubtedly crucial public health and health 
economics issues in the 21st century. Diabetes is a 
metabolic diseases characterized by high blood 
glucose levels, which result from body does not 
produce enough insulin or the body is resistant to the 
effects of insulin, named silent killer. The body needs 
insulin to use sugar, fat and protein from the diet for 
energy. Diabetes is associated with many 


complications and it can increase the risk of blindness, 
blood pressure, heart disease, kidney disease, and 
nerve damage (Temurtas et al., 2009). 

Diabetes disease is generally categorized in two 
categories, diabetes type I and diabetes type II. The 
most usual form of diabetes is diabetes type II or 
diabetes mellitus type. In diabetes type II the body is 
resistant to the effects of insulin. Millions of people 
have been diagnosed with diabetes type II and 
unfortunately, many more unaware that they are at 
high risk. Despite recent medical progresses, early 
diagnosis of disease has improved but about half of 
the patients of diabetes type II are unaware of their 
disease and may take more than ten years as the delay 
from disease onset to diagnosis while early diagnosis 
and treatment of this disease is vital. 

Classification systems have been widely utilized in 
medical domain to explore patient's data and extract a 
predictive model. They help physicians to improve 
their prognosis, diagnosis or treatment planning 
procedures. In recent years, many studies have been 
performed in the diagnosis of diabetic disease 
literature. Several different methods, such as logistic 
regression. Naive Bayes, Semi-Naive Bayes, multi- 
layer perceptrons (MLPs), radial basis functions 
(RBFs), general regression neural networks (GRNNs), 
support vector machines (SVMs), Least square support 
vector machines (LS-SVMs) have been used in some of 
these studies (Charya et al., 2010; Kayaer & Yildirim, 
2003; Bennett & Blue, 1998; Friedman et al. 1997). 
Decision tree techniques also have been widely used 
to build classification models as such models closely 
resemble human reasoning and are easy to understand. 

Ton et al. (2006) constructed a classification model for 
diabetes type II using anthropometrical body surface 
scanning data. They applied four data mining 
approaches, including artificial neural network, 
decision tree, logistic regression, and rough set, to 
select the relevant features from the data to classify 
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diabetes. They showed that the accuracy of the 
decision tree and rough set was found to be superior 
to that of logistic regression and neural network. 
Joseph ef al. (2002) used the classification tree for 
classification and regression trees with a binary target 
and ten attributes including age, sex, emergency 
department visits, office visits, comorbidity index, 
dyslipidemia, hypertension, cardiovascular disease, 
retinopathy and end-stage renal disease. 

Polat ef al. (2008) proposed a new cascade learning 
system based on generalized discriminant analysis 
and least square support vector machine for 
classification of diabetes disease. They examined the 
robustness of their proposed system by using 
classification accuracy, k-fold cross-validation method 
and confusion matrix. Huang ef al. (2007) first applied 
feature selection methods in order to discover key 
attributes affecting diabetic control, and then used 
three complementary classification techniques 
including Naive Bayes, IB land C4.5, to classify how 
well the patients' condition was controlled. Hung ef al. 
(2012) proposed system utilized the supervised 
classifier to screen the import risk factors for different 
chronic illnesses and then used these significant risk 
factors to implement the classification and to construct 
the early-warning criteria. 

Calisir and Dogantekin (2011) introduced an 
automatic diagnosis system integrated linear 
discriminant analysis (LDA) and Morlet wavelet 
support vector machine (LDA-MWSVM) for diabetes 
classification. Patil ef al. (2010) build a hybrid 
classification model, which could accurately classify 
newly diagnosed patients (pregnant women) into 
either group that is likely to develop diabetes or into a 
group, which does not develop the diabetes in five 
years from the time of first diagnosis. Zhao (2007) 
propose a multi-objective genetic programming 
approach to developing Pareto optimal decision trees 
and illustrate its application in the diabetes 
classification. 

Recently, fuzzy approaches have become one of the 
well-known solutions for improving classification 
models. Fuzzy theory was originally developed to deal 
with problems involving linguistic terms (Zadeh, 
1975a) and have been successfully applied to the broad 
range of problems. Fuzzy logic (Zadeh, 1975b) 
improves classification and decision support systems 
(DSS) by allowing the use of overlapping class 
definitions and its powerful capabilities to handle 
uncertainty and vagueness (Shi et al., 1999). 


Ganji & Abadeh (2011) proposed an ant colony-based 
classification system to extract a set of fuzzy rules for 
diagnosis of diabetes disease, named FCS-ANTMINER. 
Khashei ef al. (2012) proposed new hybrid model 
combining artificial intelligence with fuzzy models in 
order to benefit from unique advantages of these 
techniques to construct an efficient and accurate 
hybrid classifier. Kahramanli and Allahverdi (2008) 
presented a new method for classification of data of a 
medical database and developed a hybrid neural 
network that includes artificial neural networks and 
fuzzy neural networks (FNNs). 

In this paper, a two-stage hybrid classification model 
of traditional multi-layer perceptrons is proposed in 
order to yield more accurate results than other those 
models in diabetes type II classification. In the first 
stage of proposed model, a multi-layer perceptron is 
used to pre-process of raw data and provide necessary 
background in order to apply a fuzzy regression 
model. In second stage, the obtained parameters of 
first stage are considered in the form of fuzzy numbers 
and then the optimum values of proposed model 
parameters are calculated using the basic concept of 
fuzzy regression. In order to show the effectiveness 
and appropriateness of proposed model, its 
performance is compared with those of some fuzzy 
and nonfuzzy, linear and nonlinear, and intelligent 
classification models. Empirical results of Pima Indian 
diabetes data classification indicate that the proposed 
model is an effective way in order to improve 
classification accuracy. 

The rest of the paper is organized as follows. In the 
next Section, the basic concepts and modelling 
approaches of the traditional multi-layer perceptrons 
(MLPs) and other used classification models in this 
paper are briefly reviewed. The formulation of the 
hybrid proposed model to classification tasks is 
reviewed in Section 3. In Section 4, the used data set, 
Pima Indian diabetes data set, is briefly introduced. In 
Section 5, the proposed model is applied to Pima 
Indian diabetes data set classification. In Section 6, the 
performance of the proposed model is compared to 
some other classification models, presented in the 
literature for diabetes classification. Finally, the 
conclusions are discussed. 

Classification Approaches 

In this section, the basic concepts and modelling 
approaches of the multi-layer perceptrons (MLPs), 
support vector machines (SVMs), h- nearest neighbour 
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(KNN), quadratic discriminant analysis (QDA), and 
linear discriminant analysis (LDA) models for 
classification are briefly reviewed. 

Multi-Layer Perceptrons (MLPs) 

Artificial neural networks (ANNs) are computer 
systems developed to mimic the operations of the 
human brain by mathematically modelling its neuro- 
physiological structure. Artificial neural networks 
have been shown to be effective at approximating 
complex nonlinear functions (Zhang, 2001). For 
classification tasks, these functions represent the shape 
of the partition between classes. In artificial neural 
networks, computational units called neurons replace 
the nerve cells and the strengths of the 
interconnections are represented by weights, in which 
the learned information is stored. This unique 
arrangement can acquire some of the neurological 
processing ability of the biological brain such as 
learning and drawing conclusions from experience. 
Artificial neural networks combine the flexibility of 
the boundary shape found in /(-nearest neighbour 
with the efficiency and low storage requirements of 
discriminant functions. Like the K-nearest neighbour, 
artificial neural networks are data driven; there are no 
assumed model characteristics or distributions, as is 
the case with discriminant analysis (Berardi & Zhang, 
1999). 

Multi-layer perceptrons (MLPs) are one of the most 
important and widely used forms of artificial neural 
networks for modelling, forecasting, and classification 
(Silva, 2008). These models are characterized by the 
network of three layers of simple processing units 
connected by acyclic links (Fig. 1). The relationship 
between the output (y) and the inputs ( x 1 ,x 2 ,...,x p ) 
has the following mathematical representation: 

<7 P 

y t = w o+^Wj ■ s( w o,j + 'Yj w ij ■ x t,i) +£ t’ (l) 

j=l i=l 

where, wy •(/ =0,1,2, j = l,2,...,q) and w,-(j =0,1,2, ...,q) 
are model parameters often called connection weights; 

g is the hidden transfer function; s ' is the white noise 
time t; p is the number of input nodes; and q is the 
number of hidden nodes. Data enters the network 
through the input layer, moves through hidden layer, 
and exits through the output layer. Each hidden layer 
and output layer node collects data from the nodes 
above it (either the input layer or hidden layer) and 


applies an activation function. Activation functions 
can take several forms. The type of activation function 
is indicated by the situation of the neuron within the 
network. In the majority of cases input layer neurons 
do not have an activation function, as their role is to 
transfer the inputs to the hidden layer. The logistic 
and hyperbolic functions are often used as hidden 
layer and output transfer functions for classification 
problems that are shown in Eq. 2 and Eq. 3, 
respectively. Other transfer functions can also be used 
such as linear and quadratic, each with a variety of 
modelling applications. 


Sig(x) 


1 

/ + exp(-x ) 


( 2 ) 


Tanh(x) = 1 ~ exp( ~ 2x) 
1 + exp(-2x ) 


( 3 ) 


The simple network given by (1) is surprisingly 
powerful in that it is able to approximate the arbitrary 
function as the number of hidden nodes when q is 
sufficiently large. In practice, simple network structure 
that has a small number of hidden nodes often works 
well in out-of-sample forecasting. This may be due to 
the over-fitting effect typically found in the neural 
network modelling process. An over-fitted model has 
a good fit to the sample used for model building but 
has poor generalizability to data out of the sample. 



FIG. 1 MULTI-LAYER PERCEPTRON STRUCTURE (N <p ' q ' I> ) 

There exist many different approaches such as the 
pruning algorithm, the polynomial time algorithm, the 
canonical decomposition technique, and the network 
information criterion for finding the optimal 
architecture of an artificial neural network. These 
approaches can be generally categorized as follows 
(Khashei & Bijari, 2010): (i) Empirical or statistical 
methods that are used to study the effect of internal 
parameters and choose appropriate values for them 
based on the performance of model. The most 
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systematic and general of these methods utilizes the 
principles from Taguchi's design of experiments, (ii) 
Hybrid methods such as fuzzy inference where the 
artificial neural network can be interpreted as an 
adaptive fuzzy system or it can operate on fuzzy 
instead of real numbers, (iii) Constructive and/or 
pruning algorithms that, respectively, add and/or 
remove neurons from an initial architecture using a 
previously specified criterion to indicate how artificial 
neural network performance is affected by the changes. 
The basic rules are that neurons are added when 
training is slow or when the mean squared error is 
larger than a specified value. In opposite, neurons are 
removed when a change in a neuron's value does not 
correspond to a change in the network's response or 
when the weight values that are associated with this 
neuron remain constant for a large number of training 
epochs, (iv). Evolutionary strategies that search over 
topology space by varying the number of hidden 
layers and hidden neurons through application of 
genetic operators and evaluation of the different 
architectures according to an objective function 
(Benardos et al. 2007). 


Although many different approaches exist in order to 
find the optimal architecture of an artificial neural 
network, these methods are usually quite complex in 
nature and are difficult to implement (Zhang & 
Patuwo, 1998). Furthermore, none of these methods 
can guarantee the optimal solution for all real 
forecasting problems. To date, there is no simple clear- 
cut method for determination of these parameters and 
the usual procedure is to test numerous networks with 
varying numbers of hidden units, estimate 
generalization error for each and select the network 
with the lowest generalization error (Hosseini et al. 
2006). Once a network structure is specified, the 
network is ready for training a process of parameter 
estimation. The parameters are estimated such that the 
cost function of neural network is minimized. Cost 
function is an overall accuracy criterion such as the 
following mean squared error: 


E = 


n=l 


TV- 


n=l 


f 

9 

r y 

y t ~ 

w o g( W °J 


V 

l i=l 

i=1 h 


W ( 4 ) 


where, N is the number of error terms. This 
minimization is done with some efficient nonlinear 
optimization algorithms other than the basic back 


propagation training algorithm (Rumelhart & 
McClelland, 1986), in which the parameters of the 
neural network, wy, , are changed by an amount Aw l f , 

according to the following formula: 


Aw u =-rj 


8E 

dWij 


(5) 


where, the parameter rj is the learning rate and 
is the partial derivative of the function E with 

respect to the weight wy,- . This derivative is 
commonly computed in two passes. In the forward 
pass, an input vector from the training set is applied to 
the input units of the network and is propagated 
through the network, layer by layer, producing the 
final output. During the backward pass, the output of 
the network is compared with the desired output and 
the resulting error is then propagated backward 
through the network, adjusting the weights 
accordingly. To speed up the learning process, while 
avoiding the instability of the algorithm, Rumelhart 
and McClelland (1986) introduced a momentum term 
5 in Eq. (5), thus obtaining the following learning 
rule: 

dE 

Aw ij {t +1 )=-n + $ Aw ij v) • ( 6 ) 



The momentum term may also be helpful to prevent 
the learning process from being trapped into poor 
local minima, and it is usually chosen in the interval 
[0; 1]. Finally, the estimated model is evaluated using a 
separate hold-out sample that is not exposed to the 
training process. 

Linear Discriminant Analysis (LDA) 

Linear discriminant analysis (LDA) is a very simple 
and effective supervised classification method with 
wide applications. The basic theory of linear 
discriminant analysis is to classify compounds by 
dividing an n-dimensional descriptor space into two 
regions that are separated by a hyper-plane that is 
defined by a linear discriminant function. 
Discriminant analysis generally transforms 
classification problems into functions that partition 
data into classes, thus reducing the problem to the 
identification of a function. The focus of discriminant 
analysis is on determining this functional form and 
estimating its coefficients. In the linear discriminant 
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analysis, this function is assumed to be linear. Ronald 
Aylmer Fisher (1936) first introduced the linear 
discriminant function. Fisher's linear discriminant 
function works by finding the mean of the set of 
attributes for each class and using the mean of these 
means as the boundary. The function achieves this by 
projecting the attribute points onto the vector that 
maximally separates their class means and minimizes 
their within-class variance. The Fisher's linear 
discriminant function can be written as follows: 


x's- 1 {x 2 -x 1 )-y 2 {x 2 +x 1 )s- 1 {x 2 -x,)>c (7) 


where X is the vector of the observed values, 
X i (i = l,2), is the mean of values for each group, S is 
the sample covariance matrix of all variables, and c is 
the cost function. If the misclassification cost of each 
group is considered equal, c is set to zero. A member is 
classified into one group if the result of the equation is 
greater than c (or zero) and into the other if less than c 
(or zero). A result equal to c indicates that a sample 
cannot be classified into either class based on the 
features used in the analysis. 

The linear discriminant function distinguishes 
between two classes. If a data set has more than two 
classes, the process must be broken down into 
multiple two-class problems. The linear discriminant 
function was found for each class versus all samples 
that were not of that class (one-versus-all). Final class 
membership for each sample was determined by the 
linear discriminant function that produced the highest 
value. Linear discriminant analysis is optimal when 
the variables are normally distributed with equal 
covariance matrices. In this case, the linear 
discriminant function is in the same direction as the 
Bayes optimal classifier (Billings & Lee, 2002). The 
linear discriminant is known to perform well on 
moderate sample sizes when compared to more 
complex methods (Ghiassi & Burnley, 2010). As a 
straightforward mathematical function, requiring 
nothing more complicated than matrix arithmetic, the 
linear discriminant is relatively simple to perform. The 
assumption of linearity in the class boundary, 
however, limits the scope of application for linear 
discriminant analysis. Real-world data frequently 
cannot be separated by a linear boundary. When 
boundaries are nonlinear, the performance of the 
linear discriminant may be inferior to other 
classification methods. 


Quadratic Discriminant Analysis (QDA) 

Quadratic discriminant analysis (QDA), first 
introduced by Smith (1947), is another distance-based 
classifier, which is very similar to the linear 
discriminant function classifier. In fact, quadratic 
discriminant analysis is an extended of the linear 
discriminant function. Both discriminant functions 
assume that the values of each attribute in each class 
are normally distributed, however, the discriminant 
score between each sample and each class is calculated 
using the sample variance-covariance matrix of each 
class separately rather than the overall pooled matrix 
and so is a method that takes into account the different 
variance of each class. 

On the other hand, in linear discriminant analysis it is 
assumed that the covariance matrices of the groups are 
equal, whereas quadratic discriminant analysis makes 
no such assumption. When the covariance matrices are 
not equal, the boundary between the classes will be a 
hyper-conic and in theory, the use of quadratic 
discriminant analysis will result in better 
discrimination and classification rates. However, due 
to the increased number of additional parameters that 
need to be estimated, it is quite possible that the 
classification by quadratic discriminant analysis is 
worse than that of linear discriminant analysis 
(Malhotra et al. 1999). The quadratic discriminant is 
found by evaluating the equation: 


X'ls^-Sf 1 


)x+i 


x' 2 Sf‘ -x',s - 1 


L 1^1 


X 2 S 2 ~ 1 X 2 -X 1 Sf 1 X 1 +Ln\ 


h 

LL 


>c 


( 8 ) 


The same conditions apply to the nature of c and 
classification in the case that the result is equal to c or 
zero. As with the linear discriminant, the quadratic 
discriminant function distinguishes between two 
classes. For multiple class data sets, this was handled 
the same as for linear discriminant analysis. The size 
of the differences in variances determines how much 
better the quadratic discriminant function will 
perform than the linear discriminant. For large 
variance differences, the quadratic discriminant excels 
when compared to the linear discriminant. 
Additionally, of the two, only the quadratic 
discriminant can be used when population means are 
equal. Although more broadly applicable than the 
linear discriminant, the quadratic discriminant is less 
resilient under non-optimal conditions. The quadratic 
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discriminant can behave worse than the linear 
discriminant for small sample sizes. Additionally, data 
that is not normally distributed results in a poorer 
performance by the quadratic discriminant, when 
compared to the linear discriminant. 

Marks and Dunn (1974) found the performance of the 
quadratic discriminant function to be more sensitive to 
the dimensions of the data than the linear 
discriminant, improving as the number of attributes 
increases to a certain optimal number, then rapidly 
declining. Linear and nonlinear discriminant functions 
are the most widely used classification methods. This 
broad acceptance is due to their ease of use and the 
wide availability of tools. Both, however, assume the 
form of the class boundary is known and fits a specific 
shape. This shape is assumed to be smooth and 
described by a known function. These assumptions 
may fail in many cases. In order to perform 
classification for a wider range of real-world data, a 
method must be able to describe boundaries of 
unknown, and possibly discontinuous, shapes. 

K-Nearest Neighbour (KNN) 

The K-nearest neighbour (KNN) model is a well- 
known supervised learning algorithm for pattern 
recognition that first introduced by Fix and Hodges in 
1951, and is still one of the most popular 
nonparametric models for classification problems (Fix 
& Hodges 1951; 1952). K-nearest neighbour assumes 
that observations, which are close together, are likely 
to have the same classification. The probability that a 
point x belongs to a class can be estimated by the 
proportion of training points in a specified 
neighbourhood of x that belong to that class. The point 
may either be classified by majority vote or by a 
similarity degree sum of the specified number ( k ) of 
nearest points. In majority voting, the number of 
points in the neighbourhood belonging to each class is 
counted, and the class to which the highest proportion 
of points belongs is the most likely classification of x. 
The similarity degree sum calculates a similarity score 
for each class based on the K-nearest points and 
classifies x into the class with the highest similarity 
score. Due to its lower sensitivity to outliers, majority 
voting is more commonly used than the similarity 
degree sum (Chaovalitwongse, 2007). In this paper, 
majority voting is used for the data sets. 

In order to determine which points belong in the 
neighbourhood, the distances from x to all points in 
the training set must be calculated. Any distance 


function that specifies which of two points is closer to 
the sample point could be employed (Fix & Hodges 
1951. The most common distance metric used in K- 
nearest neighbour is the Euclidean distance (Viaene, 
2002). The Euclidean distance between each test point 
f t and training set point f s , each with n attributes, is 
calculated using the equation: 

d = if tl - f sl y + {ft 2 - /. s 2 y + + ifm - fsn y Y 2 (9) 

In general the following steps are performed for the K- 
nearest neighbour model (Yildiz et al., 2008): 

i) Chosen of k value. 

ii) Distance calculation. 

iii) Distance sort in ascending order. 

iv) Finding k class values. 

v) Finding dominant class. 

One challenge to use the K-nearest neighbour is to 
determine the optimal size of k, which acts as a 
smoothing parameter. A small k will not be sufficient 
to accurately estimate the population proportions 
around the test point. A larger k will result in less 
variance in probability estimates but the risk of 
introducing more bias. K should be large enough to 
minimize the probability of a non-Bayes decision, but 
small enough that the points included give an accurate 
estimate of the true class. Enas and Choi (1986) found 
that the optimal value of k depends upon the sample 
size and covariance structures in each population, as 
well as the proportions for each population in the total 
sample. For cases in which the differences in the 
covariance matrices and the difference between 
sample proportions were either both small or both 
large, Enas and Choi (1986) found that the optimal k to 
be N 3 / 8 , where N is the number of samples in the 
training set. When there was a large difference 
between covariance matrices and a small difference 
between sample proportions, or vice versa, they 
determined N 2 ! 8 to be the optimal value of k. This 
model presents several advantages (Berrueta et al., 
2007): 

(i) Its mathematical simplicity, which does not 
prevent it from achieving classification results as 
good as (or even better than) other more complex 
pattern recognition techniques. 
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(ii) It is free from statistical assumptions, such as the 
normal distribution of the variables. 

(iii) Its effectiveness does not depend on the space 
distribution of the classes. 

In additional, when the boundaries between classes 
cannot be described as hyper-linear or hyper-conic, K- 
nearest neighbour performs better than the linear and 
quadratic discriminant functions. Enas and Choi 
(1986) found that the linear discriminant performs 
slightly better than /(-nearest neighbour when 
population covariance matrices are equal, a condition 
that suggests a linear boundary. As the differences in 
the covariance matrices increases, /(-nearest neighbour 
performs increasingly better than the linear 
discriminant function. 

However, despite of the all advantages cited for the K- 
nearest neighbour models, they also have some 
disadvantages, /(-nearest neighbour model cannot 
work well if large differences are present in the 
number of samples in each class, /(-nearest neighbour 
provides poor information about the structure of the 
classes and of the relative importance of each variable 
in the classification. Furthermore, it does not allow a 
graphical representation of the results, and in the case 
of large number of samples, the computation can 
become excessively slow. In addition, /(-nearest 
neighbour model much higher memory and 
processing requirements than other methods. All 
prototypes in the training set must be stored in 
memory and used to calculate the Euclidean distance 
from every test sample. The computational complexity 
grows exponentially as the number of prototypes 
increases (Muezzinoglu & Zurada, 2006). 

Support Vector Machines (SVMs) 

Support vector machines (SVMs) are a new pattern 
recognition tool theoretically founded on Vapnik's 
statistical learning theory (Vapnik, 1998). Support 
vector machines, originally designed for binary 
classification, employs supervised learning to find the 
optimal separating hyper-plane between the two 
groups of data. Having found such a plane, support 
vector machines can then predict the classification of 
an unlabeled example by asking on which side of the 
separating plane the example lies. Support vector 
machine acts as a linear classifier in a high 
dimensional feature space originated by a projection of 
the original input space, the resulting classifier is in 
general non-linear in the input space and it achieves 


good generalization performances by maximizing the 
margin between the two classes. In the following, we 
give a short outline of construction of support vector 
machine. 

Consider a set of training examples as follows: 

{(•W,)} eR n ,y t e{+l,-l}; i = l,2,...,m (10) 

where the x ; are real n-dimensional pattern vectors 
and the y t are dichotomous labels. Support vector 

machine maps the pattern vectors xeR" into a 
possibly higher dimensional feature space ( z = <j(x )) 
and construct an optimal hyper-plane w-z + b = 0 in 
feature space to separate examples from the two 
classes. For support vector machine with LI soft- 
margin formulation, this is done by solving the primal 
optimization problem as follows: 

1 ii n2 

Min — w +C/ 4/ 

? " II (11) 

s.t. y^w-Zi+b )>l~4i ^>02 = 1.2,..., m 

where C is a regularization parameter used to decide a 
trade off between the training error and the margin, 
and 4; ( i = l,2,...,m ) are slack variables. The above 
problem is computationally solved using the solution 
of its dual form: 

m j m 

Max ^a,- - — ^ jy,y jk{xj ,xj) 

' " ' ' ( 12 ) 

m v ' 

s.t. = 0; 0 < a, <C,i = 1,2,... ,m 

i=l 

where k(x i ,x i )= r/i(x i )-f/(xj ) is the kernel function that 
implicitly define a mapping <j> . The resulting decision 
function is: 


(13) 

All kernel functions have to fulfil Mercer theorem; 
however, the most commonly used kernel functions 
are polynomial kernel and radial basis function kernel, 
respectively (Song & Tang, 2005). 

k{xi, x j] = {a{x i ,xj)+b ] f (14) 


I m 
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k[xi,Xj)=i 


j ) = exp\ -g LY,-,Xy 


(15) 


Support vector machines differ from discriminant 
analysis in two significant ways. First, the feature 
space of a classification problem is not assumed to be 
linearly separable. Rather, a nonlinear mapping 
function (also called a kernel function) is used to 
represent the data in higher dimensions where the 
boundary between classes is assumed to be linear 
(Duda et al., 2001). Second, the boundary is 
represented by support vector machines instead of a 
single boundary. Support vectors run through the 
sample patterns which are the most difficult to classify, 
thus the sample patterns that are closest to the actual 
boundary. Over-fitting is prevented by specifying a 
maximum margin that separates the hyper plane from 
the classes. Samples, which violate this margin, are 
penalized. The size of the penalty is a parameter often 
referred to as C (Brown et al., 2000; Christianini & 
Taylor, 2000). 


Formulation the Hybrid Proposed Model 

Multi-layer perceptrons (MLPs) are flexible computing 
frameworks and universal approximators that can be 
applied to a wide range of classification problems with 
a high degree of accuracy (Khashei et al., 2012). 
Several distinguishing features of multi-layer 
perceptrons make them valuable and attractive for 
classification tasks. The most important of these, is 
that MLPs, as opposed to the traditional model-based 
techniques, are data-driven self-adaptive methods in 
that there are few a priori assumptions about the 
models for problems under study (Khashei et al., 2009). 
The parameter of MLP models (weights and biases) 
are crisp (w t j(i = 0,l,2,...,p j = 1,2,..., q ) , w(j =0,1,2 ,.. .,</)). 
In proposed model, instead of using crisp, fuzzy 


parameters in the form of triangular fuzzy numbers 
are used for related parameters of layers 
( w ij (i = 0,l,2,...,p j = 1,2 q) , wj{j = 0,1,2,... ,q) ). The 

model is described using a fuzzy function with a fuzzy 
parameter (Khashei et al., 2012): 

<7 P 

y, = f(w 0 + ^ Wj ■ g(w 0 j + ]T W UJ ■ y t _ t )), ( 16 ) 

j=i i=i 

Where, y t are observations, wj(j =0,1,2, ...,q ) , 
w Uj {i = 0,1,2,..., p j = 1,2,. ...q) , are fuzzy numbers. Eq. 
(16) is modified as follows: 

<7 ‘7 

y t =f(w 0 +'Y j w j -x t j )=f('Y J w j -x t j), ( 17 ) 

1=1 1=0 


p 

where , X,j = g(w 0 j + 'S'w i j ■y t _ i ) . Fuzzy parameters 

7=7 

in the form of triangular fuzzy numbers 

w t j =[dj j,bi ,j,c iy j) are used: 


Mwjj 


( w 7 ,/) = 


l 1 

b i ,, 

7 a i,j 


1 

b i,, 

j~ c i,j 

0 



\ w iJ ■ "7./ 
\ w i,l ~ c i,l 


) if "i.j 'I'.J- 

) ¥ h i.i “'"I./ <Ci.j. 


otherwise, 


(18 

) 


Where, /u^(wjj) is the membership function of the 
fuzzy set that represents parameter w t • . By applying 
the extension principle, it becomes clear that the 

p 

membership of X tJ - = g( w 0 j +Yj*i,j-yt-i) inE q- ( 17 ) is 

i=l 

given as (Khashei et al., 2008): 




hr 


»x, 


.W— 




^ P 

n=o 


II>.rF,:-<:X;>7.r.L.7 


if - x ‘j - g {Tlo bi ’i ■*') 

if 8 {Hlo bi ’J ' y ti)- x ‘j ^(2L c 'h- -yti) 


0 


otherwise, 


(19) 
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where, y ti = y t _ t ( t = l,2,...,k i = l,2 p), and 

y tii =l (t = l,2,...,k i = o). Considering triangular fuzzy 
numbers, X t • with membership function Eq. (19) and 
triangular fuzzy parameters w,- will be as follows: 


Mwj 


i w j)-~ 


\ 1 

e j 

~ d i 


1 

e J 

-fj 

0 




if dj<Wj< ej, 
f ej<Wj<fj, 


otherwise, 


( 20 ) 


The membership function of 

q q 

y, = f (W 0 +^Wj ■ x tj ) = f (^Wj ■ x tj ) is given as (21). 

j=l j =0 

Now considering a threshold level h for all 
membership function values of observations, the 
nonlinear programming is given as (22). 

As a special case and to present the simplicity and 
efficiency of the model, the triangular fuzzy numbers 


are considered symmetric, output neuron transfer 
function is considered to be linear, and connected 
weights between input and hidden layer are 
considered to be of a crisp form. The membership 
function of y t in the special case mentioned is 
transformed as follows: 


vt 


-£ 


?(h) = < 


j=o 


a , ■ X, 


j ^ tj 




for x u*0, 


(23) 


otherwise. 


Simultaneously, y t represents the fth observation and 
h-level is the threshold value representing the degree 
to which the model should be satisfied by all the data 
points y I ,y 2 ,—„yk ■ A choice of the h value influences 
the widths of the fuzzy parameters: 

My(y t )^ h fort = 1,2, k, (24) 


zh 

2A 1 


+ 





if C 1 <r\y t )<C 3 , 


hi 


4h) = 


Jh_ 

2 A, 



c 2 -r 1 {y,) 


^1/ 


if C 3 <r I {y t )<C 2 , 


( 21 ) 


otherwise. 


where. 
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Min 


i=i 




Subject.to 


2 A, 


2 Ay 


B, V Cj-f-’jy,) 

2 A, 


B^\ Ci-r^y,) 

2A 2 


^ h if Cj<f ‘(y t )<C 3 , for t =1,2, k. 


if C 3 <f 1 (y t )<C 2 , for t = l,2, k. 


( 22 ) 


The index f refers to the number of non-fuzzy data 
used for constructing the model. On the other hand, 
the fuzziness S included in the model is defined by: 

4 k 

5= ZZ c ;blKi|’ (25) 

j=0 t=l 


Where, w,- is the connection weight between output 
neuron and jth neuron of the hidden layer; x t J is the 

output value of jth neuron of the hidden layer in the 
time t. Next, the problem of finding the parameters in 
the proposed method is formulated as a linear 
programming problem as follows: 


q k 


Minimize S 


j=0 t=l 


X, 


tj\ 


subject.to 


^aj x tj +(l-h]y j c j \x, j 

j=0 \j=0 ) 

-M[£c, \X (J \ 


>y t t = l,2,..,h 


j=o 


j=o 


~y, t = l,2,..,h 


Cj S 0 for j = 0,1,. ...q. 


(26) 


Then, the data around the upper and lower bound of 
the proposed model, when model has outliers with a 
wide spread, are deleted in accordance with 
Ishibuchi's recommendations. In order to make the 
model to include all possible conditions, c • has a wide 

spread when the data set includes a significant 
difference or outlying case. Ishibuchi and Tanaka 
(1988) suggest that the data around the model's upper 
and lower boundaries be deleted so that the fuzzy 
regression model can be reformulated. Final point is 
that the output of the proposed model is fuzzy and 
continuous, while our classification problem differs in 
that its output is discrete and nonfuzzy. Therefore, in 


order to apply the proposed model to classification, 
certain modifications to the model needed to be made. 
For this purpose, each class is firstly assigned a 
numeric value, and then the membership probability 
of the output in each class is calculated as follows: 


--1-Pr 


rm p+co 

f(x)dx f(x)dx 

, J— oo j __ Jm 

p+OO p+00 

I f( x ) dx / ( x ) dx 

J — 00 J— 00 


(27) 


where P A and P B are the membership probability of 
the class A and class B , respectively, and m is the 
mean of the class values. Finally, the sample is put in 
the class with which its output has the largest 
probability. In proposed model, due to this fact that 
output is fuzzy, it may be better to apply the large 
class values. The larger class values expand small 
differences in the output, helping the model to become 
more sensitive to variations in the input. For example, 
instead of using the {-1, +l} or { 0,+l }, the {-10,+10} or 
{-100, +100} are better to be used as class values 
(Khashei et al„ 2012). 

Pima Indian Diabetes Data Set 

The Pima Indian Diabetes data set is collected by the 
National Institute of Diabetes and Digestive and 
Kidney Diseases and consists of diabetes diagnoses 
(positive or negative) and attributes of female patients 
who are at least 21 years old and of Pima Indian 
heritage (Golub et al. 1999). The eight attributes 
represent 1) the number of times pregnant, 2) the 
results of an oral glucose tolerance test, 3) diastolic 
blood pressure (mm Hg), 4) triceps skin fold thickness 
(mm), 5) 2-h serum insulin (micro U/ml), 6) body mass 
index (weight in kg/(height in m) A 2), 7) diabetes 
pedigree function, and 8) age (year). The two- 
dimensional distribution of these two classes against 
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FIG. 2 THE TWO-DIMENSIONAL DISTRIBUTION OF PIMA INDIAN DIABETES CLASSES 


TABLE 1 BRIEF STATISTICAL INFORMATION OF ATTRIBUTES 


No. 

Attribute Name 

Mean 

Standard 

Deviation 

1 - 

Number of times 
pregnant 

3.8 

3.4 

2- 

Plasma glucose (2 Hours) 

120.9 

32 

3- 

Diastolic blood pressure 

69.1 

19.4 

4- 

Triceps skin fold 
thickness 

20.5 

16.0 

5- 

Two-hour serum insulin 

79.8 

115.2 

6- 

Body mass index 

32.0 

7.9 

7- 

Diabetes pedigree 
function 

0.5 

0.3 

8- 

Age 

33.2 

11.8 


the (X2, X3), (X6, X7), (X2, X6), (X3, X7), (X2, X7), and 
(X3, X6), as example, is shown in Fig. 2. Some 
statistical information of attributes is given in Table 1. 
The data set consists of 768 samples, about two third 
of which have a negative diabetes diagnosis and one 
third with a positive diagnosis. The data set is 
randomly split into equal size of training and test sets 
of 384 samples each. 

Application of the Hybrid Proposed Model to 
Diabetes Classification 

In order to obtain the optimum network architecture 
of the proposed model based on the concepts of multi- 
layer perceptrons design (Khashei & Bijari, 2011) and 
using pruning algorithms in MATLAB 7 package 
software, different network architectures are evaluated 
to compare the MLPs performance. 
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The best fitted network which is selected, and 
therefore, the architecture which presented the best 
accuracy with the test data, is composed of eight 
inputs, five hidden and one output neurons (in 
abbreviated form, (V®' 5 ' 1 -*). Then, the minimal fuzziness 
of the fuzzy parameters is determined using Eq. (11) 
with h=0. 

As mentioned previously, the h-level value influences 
the widths of the fuzzy parameters. In this case, we 
consider h=0 in order to yield parameters with 
minimum of width. The misclassification rate of each 
model and improvement percentages of the proposed 
model in comparison with those of other classification 
models for the Pima Indian diabetes data in both 
training and test data sets are summarized in Table 2 
and Table 3, respectively. The misclassification rate 
and improvement percentage of the model (B) against 
the model (A) are respectively calculated as follows: 

... , . .. . „ No. of incorrect diagnosis 

Misclassification Rate [MR j = 2 (28) 

No. of sample set x ’ 


Improvement Percentage = 


_ MR [A) -MR (B) 


MR {A) 


x 100% 


Comparison with Other Models 

According to the obtained results (Tables 2 & 3), our 
proposed model has the lowest error on the test 
portion of the data set in comparison to other those 
used models for the Pima Indian Diabetes data set, 
with a misclassification rate of 18.8%. Several different 
architectures of artificial neural network are designed 
and examined. The best performing architecture for a 
traditional multi-layer perceptron produces a 25.3% 
error rate, which proposed model improves by 25.69%. 
Linear discriminant analysis performs second best 
with an error rate of 21.9%, a classification rate 14.16% 
worse than the proposed model. Quadratic 
discriminant analysis misclassifies 28.1% of the test 
samples, which is also a 33.10% worse than the 
proposed model. As K-nearest neighbour scores can be 
sensitive to the relative magnitude of different 
attributes, all attributes are scaled by their z-scores 
before using k- nearest neighbour model (Antal et al., 
2003). The best K-nearest neighbour, with a K=13 has 
error rates of 24.7%that is a 23.89% higher than the 
proposed model error. The support vector machine 
model with C=0 produces an error rate of 30.0%. The 
proposed model improves upon these by 37.33% for 
the support vector machine. 


TABLE 2 PIMA INDIAN DIABETES DATA SET CLASSIFICATION 
RESULTS 

Classification error 


Model 

Training 

Data 

Test 

Data 

Linear Discriminant Analysis (LDA) 

%2 6.6 

%21.9 

Quadratic Discriminant Analysis (QDA) 

%2 3.7 

%28.1 

K-Nearest Neighbour (KNN) [K=13] 

%2 3.4 

%24.7 

Support Vector Machines (SVM) [C=0] 

%9.9 

%30.0 

Artificial Neural Networks (ANN) [N ,s - 5 - 33 ] 

%18.8 

%25.3 

Hybrid proposed model 

%17.6 

%18.8 


TABLE 3 IMPROVEMENT OF THE PROPOSED MODEL IN 
COMPARISON WITH THOSE OF OTHER CLASSIFICATION 
MODELS 


Improvement (%) 


Model 


Training Test 



Data 

Data 

Linear Discriminant Analysis (LDA) 

33.83 

14.16 

Quadratic Discriminant Analysis (QDA) 

25.74 

33.10 

K-Nearest Neighbour (KNN) 

24.79 

23.89 

Support Vector Machines (SVM) 

-77.78 

37.33 

Artificial Neural Networks (ANN) 

6.38 

25.69 


Conclusions 

Diabetes is a metabolic diseases characterized by high 
blood glucose levels, which result from body does not 
produce enough insulin or the body is resistant to the 
effects of insulin, named silent killer. Classification 
techniques have received considerable attention in 
biological and medical applications that greatly help 
physicians to improve their prognosis, diagnosis or 
treatment planning procedures. Both theoretical and 
empirical findings have indicated that using hybrid 
models or combining several models has become a 
common practice to reduce upon their 
misclassification rate, especially when the models in 
combination are quite different. 

In this paper, a hybrid model of multi-layer 
perceptrons is proposed as an alternative classification 
model using the unique soft computing advantages of 
the fuzzy logic. The proposed model generally consists 
of five phases as follows: 

i) Training the neural network using the 
available information from observations 
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ii) Determining the minimal fuzziness using 
the obtained weights and same criterion 

iii) Deleting the outliers accordance with 
Ishibuchi's recommendations 

iv) Calculating the membership probability of 
the output in each class 

v) Assigning the output to appropriate class by 
the largest probability 

Five well-known statistical and intelligent 

classification models —linear discriminant analysis, 
quadratic discriminant analysis, k- nearest neighbour, 
support vector machines, and multi-layer 
perceptrons — are used in this paper in order to show 
the appropriateness and effectiveness of the proposed 
model for diabetes classification. The obtained results 
indicate that the proposed model to be superior to all 
alternative models. For binary classification of the 
Pima Indian Diabetes benchmark data set, proposed 
model performs better than the traditional multi-layer 
perceptrons. The improvement varies from 6.38% to 
25.69% in comparison to the multi-layer perceptrons 
for the training and test data sets. In addition, the 
performance of the hybrid proposed model is overall 
better than support vector machine and also other 
traditional classification models such as linear 
discriminant analysis and quadratic discriminant 
analysis. 

Our proposed model does not assume the shape of the 
partition, unlike the linear and quadratic discriminant 
analysis. In contrast to the K-nearest neighbour model, 
the proposed model does not require storage of 
training data. Once the model has been trained, it 
performs much faster than /(-nearest neighbour does, 
because it does not need to iterate through individual 
training samples. The proposed model does not 
require experimentation and final selection of a kernel 
function and a penalty parameter as is required by the 
support vector machines. Our proposed model solely 
relies on a training process in order to identify the 
final classifier model. Finally, the proposed model 
dose not need large amount of data in order to yield 
accurate results, as traditional multi-layer perceptrons. 
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